community[minor]: Refactoring PyMuPDF parser, loader and add image blob parsers #29063

pprados · 2025-01-07T08:46:24Z

Adds BlobParsers for images. These implementations can take an image and produce one or more documents per image. This interface can be used for exposing OCR capabilities.
Update PyMuPDFParser and Loader to standardize metadata, handle images, improve table extraction etc.

Twitter handle: pprados

This is one part of a larger Pull Request (PR) that is too large to be submitted all at once.
This specific part focuses to prepare the update of all parsers.

For more details, see PR 28970.

vercel · 2025-01-07T08:46:28Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
langchain	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Jan 20, 2025 4:17pm

Add file_path with PurePath Add CloudBlobLoader in __init__ Replace Dict/List to dict/list

pprados · 2025-01-07T16:19:05Z

@eyurtsev I rebase the code with master ;-)

eyurtsev

Great will take a look in the AM

eyurtsev

Left two major comment, a few stylistic comments and some nits.

Let's tackle the two major comments:

Define the standardized structure of metadata
Create a dedicated ImageParser which is a blob parser

libs/community/langchain_community/document_loaders/parsers/pdf.py

pprados · 2025-01-17T13:24:44Z

yum is deprecated and replaced by dnf.
But, il doc/Makefile, yum is used.
I can not install yum on Ubuntu.
It's difficult for me to fix a bug in the documentation

libs/community/langchain_community/document_loaders/parsers/images.py

…/02-pymupdf

vercel bot deployed to Preview January 7, 2025 08:55 View deployment

vercel bot deployed to Preview January 7, 2025 09:15 View deployment

pprados marked this pull request as ready for review January 7, 2025 09:16

dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. community Related to langchain-community Ɑ: doc loader Related to document loader module (not documentation) labels Jan 7, 2025

ccurme assigned eyurtsev Jan 7, 2025

pprados added 7 commits January 7, 2025 17:08

Prepare the integration of new versions of PDFLoader.

21759e2

Add file_path with PurePath Add CloudBlobLoader in __init__ Replace Dict/List to dict/list

Fix Line too long

4607354

Fix Line too long

668dc9c

Fix Line too long

7a5b5c5

Fix Line too long

6340ded

Update PyMuPDF

4845781

Fix tu

3beda82

pprados force-pushed the pprados/02-pymupdf branch from 039819c to 3beda82 Compare January 7, 2025 16:09

vercel bot deployed to Preview January 7, 2025 16:18 View deployment

eyurtsev reviewed Jan 8, 2025

View reviewed changes

pprados mentioned this pull request Jan 8, 2025

Refactoring PDF loaders: all #28970

Draft

2 tasks

eyurtsev reviewed Jan 9, 2025

View reviewed changes

pprados added 3 commits January 9, 2025 16:48

Fix review - step 1

743a83e

Fix all remarques

b623750

Merge remote-tracking branch 'upstream/master' into pprados/02-pymupdf

20f5a41

pprados marked this pull request as draft January 10, 2025 12:45

vercel bot deployed to Preview January 10, 2025 13:30 View deployment

pprados force-pushed the pprados/02-pymupdf branch from 0d99673 to 3fe4ec5 Compare January 10, 2025 13:40

vercel bot deployed to Preview January 10, 2025 13:49 View deployment

pprados force-pushed the pprados/02-pymupdf branch 2 times, most recently from 4342991 to 760267b Compare January 10, 2025 14:05

vercel bot had a problem deploying to Preview January 17, 2025 13:10 Failure

pprados force-pushed the pprados/02-pymupdf branch from a9065ee to 0fe40a2 Compare January 17, 2025 13:32

vercel bot had a problem deploying to Preview January 17, 2025 13:38 Failure

pprados force-pushed the pprados/02-pymupdf branch from 0fe40a2 to 2c4b1f6 Compare January 17, 2025 13:49

Optimise tests

a4587f0

pprados force-pushed the pprados/02-pymupdf branch from 2c4b1f6 to a4587f0 Compare January 17, 2025 13:52

vercel bot deployed to Preview January 17, 2025 14:08 View deployment

Merge branch 'master' into pprados/02-pymupdf

d332958

vercel bot deployed to Preview January 17, 2025 14:19 View deployment

Merge branch 'master' into pprados/02-pymupdf

4b37b34

vercel bot deployed to Preview January 17, 2025 16:20 View deployment

Merge branch 'master' into pprados/02-pymupdf

2281d05

vercel bot deployed to Preview January 17, 2025 16:30 View deployment

eyurtsev reviewed Jan 17, 2025

View reviewed changes

libs/community/langchain_community/document_loaders/parsers/images.py Outdated Show resolved Hide resolved

pprados added 4 commits January 18, 2025 08:09

Remove Image.__init__

0da73f1

Merge remote-tracking branch 'origin/pprados/02-pymupdf' into pprados…

d012d60

…/02-pymupdf

Merge branch 'master' into pprados/02-pymupdf

882c90d

Remove Image.__init__

74d3617

pprados force-pushed the pprados/02-pymupdf branch from f2fea1f to 74d3617 Compare January 18, 2025 07:25

vercel bot deployed to Preview January 18, 2025 07:35 View deployment

Merge branch 'master' into pprados/02-pymupdf

318f304

vercel bot deployed to Preview January 20, 2025 07:35 View deployment

Merge branch 'master' into pprados/02-pymupdf

5ee7b9c

vercel bot deployed to Preview January 20, 2025 16:17 View deployment

eyurtsev approved these changes Jan 20, 2025

View reviewed changes

dosubot bot added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label Jan 20, 2025

eyurtsev changed the title ~~Refactoring PDF loaders: 02 PyMuPDF~~ community[minor]: Refactoring PDF loaders: 02 PyMuPDF Jan 20, 2025

eyurtsev changed the title ~~community[minor]: Refactoring PDF loaders: 02 PyMuPDF~~ community[minor]: Refactoring PyMuPDF parser, loader and add image blob parsers Jan 20, 2025

eyurtsev merged commit 4efc509 into langchain-ai:master Jan 20, 2025
21 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

community[minor]: Refactoring PyMuPDF parser, loader and add image blob parsers #29063

community[minor]: Refactoring PyMuPDF parser, loader and add image blob parsers #29063

pprados commented Jan 7, 2025 •

edited by eyurtsev

Loading

vercel bot commented Jan 7, 2025 •

edited

Loading

pprados commented Jan 7, 2025

eyurtsev left a comment

eyurtsev left a comment

pprados commented Jan 17, 2025

community[minor]: Refactoring PyMuPDF parser, loader and add image blob parsers #29063

community[minor]: Refactoring PyMuPDF parser, loader and add image blob parsers #29063

Conversation

pprados commented Jan 7, 2025 • edited by eyurtsev Loading

vercel bot commented Jan 7, 2025 • edited Loading

pprados commented Jan 7, 2025

eyurtsev left a comment

Choose a reason for hiding this comment

eyurtsev left a comment

Choose a reason for hiding this comment

pprados commented Jan 17, 2025

pprados commented Jan 7, 2025 •

edited by eyurtsev

Loading

vercel bot commented Jan 7, 2025 •

edited

Loading